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Module 3: Review of Basic Data Analytic Methods 
Using R 

Part 3: Statistics for Model Building and Evaluation 


During this lesson the following topics are covered: 

• Statistics in the Analytic Lifecycle 
Hypothesis Testing 
Difference of means 
Significance, Power, Effect Size 

• AN OVA 

Confidence Intervals 
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Statistics in the Analytic Lifecycle 


Model Building and Planning 

► Can I predict the outcome with the inputs that I have? 

► Which inputs? 

• Model Evaluation 

► Is the model accurate? 

► Does it perform better than "the obvious guess" 

► Does it perform better than another candidate model? 

• Model Deployment 

► Do my predictions make a difference? 

►► Are we preventing customer churn? 

►► Have we raised profits? 
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Hypothesis Testing 


Fundamental question: "Is there a 
difference between the populations 
based on samples?" 

► Examples : Mean, Variance 



-2 0 2 4 
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Variance: a measure of how data points differ from the 
mean 

• Data Set 1: 3, 5, 7, 10, 10 

• Data Set 2: 7, 7, 7, 7, 7 

What is the mean and median of the above data set? 

Data Set 1: mean = 7, median = 7 
Data Set 2: mean = 7, median = 7 

But we know that the two data sets are not identical! The variance 
shows how they are different. 

We want to find a way to represent these two data set numerically. 
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How to Calculate? 


We estimate the spread of a distribution as the extent 
to which the values in the distribution differ from the 
mean and from each other. 
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The average of the squared deviations about the mean 
is called the variance. 



The standard deviation s is the square root of the 
Variance. 
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Example 



Score 

X 

x-x 

(x-x ) 2 

1 

3 



2 

5 



3 

7 



4 

10 



5 

10 



Totals 

35 




The mean is 35/5=7. 
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Example(continued) 



Score 

X 

x-x 

(x-x ) 2 

1 

3 

CO 

1 

■vj 

II 

1 

-P^ 


2 

5 

5 - 7=-2 


3 

7 

7 - 7=0 


4 

10 

1 0 - 7=3 


5 

10 

1 0 - 7=3 


Totals 

35 
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Example(continued) 



Score 

X 

x-x 

(x-x ) 2 

1 

3 

CO 

1 

■vj 

II 

1 

-p^ 

16 

2 

5 

5 - 7=-2 

4 

3 

7 

7 - 7=0 

0 

4 

10 

1 0 - 7=3 

9 

5 

10 

1 0 - 7=3 

9 

Totals 

35 


38 
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Example(continued) 



Score 

X 

x-x 

(x-x ) 2 

1 

3 

CO 

1 

■vj 

II 

1 

-p^ 

16 

2 

5 

5 - 7=-2 

4 

3 

7 

7 - 7=0 

0 

4 

10 

1 0 - 7=3 

9 

5 

10 

1 0 - 7=3 

9 

Totals 

35 


38 




Copyright © 2014 EMC Corporation. All Rights Reserved, 


Example2 


Dive Mark 

1 28 

2 22 

3 21 

4 26 

5 18 
Find the mean, 

mean 

median 

range 


Myrna 


27 


27 


28 


6 


27 


median ; 

, range? 

23 

23 

22 

27 

10 

22 


Which diver was more consistent? 
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Dive 

Mark's Score 

X 

X-X 

(x-x y 

1 

28 

5 

25 

2 

22 

-1 

1 

3 

21 

-2 

4 

4 

26 

3 

9 

5 

18 

-5 

25 

Totals 

115 

0 

64 


Mark’s Variance = 64 / 5 = 12.8 
Myrna’s Variance = 362 / 5 = 72.4 

Conclusion: Mark has a lower variance therefore he is more consistent. 
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Hypothesis Testing is a common technique to 
assess the difference or significance 


Null hypothesis : There is no 
difference 


Alternate hypothesis : There is a 
difference 



-2 0 2 4 
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The central aim of statistical test is to determine the likelihood of a 
value in a sample under the assumption that the Null hypothesis 
is true 

• The HO states that there is no statistically significant difference 
between your sample and a reference population (or between two 
samples) 

• The HI states the opposite, i.e. that there is a statistically significant 
difference between your sample and a reference population (or 
between two samples) 
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Null and Alternative Hypotheses: Examples 


Null Hypothesis 

Alternative Hypothesis 

The average squared prediction 
error from the model is the same as 
the average squared prediction 
error from the null model. 

The model predicts better than the null 
model: 

• The average squared prediction 
error from the model is smaller 
than that of the null model 

This variable does not affect the 
outcome: 

• The coefficient value is zero 

The variable does affect outcome: 

• Coefficient value is non-zero 

The model predictions do not 
improve revenue(income): 
the same with or without 
intervention of hypothesis 

Interventions based on model 
predictions improve revenue: 

• A/B Testing, ANOVA 



Copyright © 2014 EMC Corporation. All Rights Reserved. Module 3: Basic Data Analytic Methods Using R 17 



Intuition: Difference of Means 


m, m 2 




Copyright © 2014 EMC Corporation. All Rights Reserved. Module 3: Basic Data Analytic Methods Using R 18 


Welch’s t-test 


t-statistic: 


t 


Xj-X 



(this is the t-statistic for the Welch t-test) 


> x = rnorm(1 0) # distribution centered at 0 

> y = rnorm(10,2) # distribution centered at 2 

> t.test(x,y) 

Welch Two Sample t-test 
data: x and y 

t = -7.2643, df = 15.05, p-value = 2.713e-06 
alternative hypothesis: true difference in means is 
equal to 0 

95 percent confidence interval: 
-2.364243-1.291811 
sample estimates: 

mean of x mean of y 
0.5449713 2.3729984 




p-value: area under the tails of the 
appropriate student's distribution 

if p-value is small (say < 0.05), then 
reject the null hypothesis 
and assume that m 1 <> m 2 

m 1 and m 2 are significantly 

different" 
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Wilcoxon Rank Sum Test 


t-test assumes that the populations are normally distributed 

► Sometimes this is close to true, sometimes not 
• Wilcoxon Rank Sum test 

► Makes no assumption about the distributions of the populations 

► More robust test for difference of means 

► if p-value is small: reject the null hypothesis (equal means) 


> mean(x) 

[1] 0.5449713 

> mean(y) 

[1] 2.372998 

> wilcox.test(x, y) 

wilcoxon rank sum test 


data: x and y 

W = 2, p-value = 4.33e-05 

alternative hypothesis: true location shift is not equal to 0 
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Wilcoxon Rank Sum Test 


Let jV be the sample size, i.e, the number of pairs. Thus : there are a total of 2U data points. For pairs i = 1, . , . , JV. let Xu and x^ denote the measurements. 

H, 0 : difference between the pairs follows a symmetric distribution around zero 

Hi: difference between the pairs does not follow a symmetric distribution around zero. 

I. For i = 1, , , . , JV, calculate |®y - *y| and ggnfoi - ay), where sgn is the sign function. 

2. Exclude pairs with |ay - scy | = 0. Let JV r be the reduced sample size. 

3. Order the remaining JV r pairs from smallest absolute difference to largest absolute difference. - ay 

4. Rank the pairs, starting with the smallest as 1. Ties receive a rank equal to the average of the ranks they span. Let R, denote the rank. 

5. Calculate the test statistic W 

& 


W - - *y) ■ R{\ ; the sum of the signed ranks. 


4=1 


The signum function of a real number x is defined as follows: 


Sgn(ar) ;= 


-I ifa<Q, 

0 if a? = 0, 

1 if x > 0, 
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Hypothesis Testing: Summary 


Calculate the test statistic 

► Different hypothesis tests are 
appropriate, in different situations 

Calculate the p-value on the test 
statistic 



If p-value is "small ' then reject the null 
hypothesis 

► "small" is often p < 0.05 by convention 
(95% confidence) 

► Many data scientists prefer a smaller 
threshold often 0.01 or 0.001. 



Copyright © 2014 EMC Corporation. All Rights Reserved. Module 3: Basic Data Analytic Methods Using R 22 


Generating a Hypothesis: Type I and Type II Error 


If H 0 is X, and we 

Null hypothesis(H 0 ) is true 

Null hypothesis(H 0 ) is false 

Fail to accept the Null / 
Hypothesis -> we claim 
something happened 

Type 1 error 

False positive 

a 

Correct Outcome 

True positive 

We reject the Null 

^hypothesise 

Fail to reject the null 
hypothesis -> we claim 
nothing happened. 

Correct outcome f 

True negative l 

Accept the NULL 
hypothesis 

Type II error 

False negative 


Example: Ham or Spam? H 0 : it’s Ham H A : it’s Spam 


It’s 

Really -> 
we say it’s i 

Ham 

Spam 

Spam 

Type 1 - false positive 

OK - true positive 


Ham 


Goal: Identify Spam 
Which error is worse? 


>e II - false negative 
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Significance, Power and Effect Size 


Significance: the probability of a false positive (a) 

► p-value is your significance 

Power: probability of a true positive (1 - P) 

Effect size: the size of the observed difference 

The actual difference in means, for example 



Copyright © 2014 EMC Corporation. All Rights Reserved. 


Module 3: Basic Data Analytic Methods Using R 25 


Always Keep Effect Size in Mind! 



Both power and significance 
increase with larger sample 
sizes. 

So you can observe an effect 
size that is statistically 
significant, but practically 
insignificant! 
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Hypothesis Testing: ANOVA 


ANOVA is a generalization of the difference of means 
One-way ANOVA 

► k populations ("treatment groups") 

► n f samples each -total N subjects 

► Null hypothesis: ALL the population means are equal 


Population 

nj: # offers made 

nrij: avg purchase 
size 

Offer 1 

100 

$55 

Offer 2 

1 02 $50 

No intervention 

99 

$25 
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ANOVA: Understanding the F statistic 



s B : how the population means vary 
with respect to the total mean m 0 



2 1 

SB = 

i 

(mi — m 0 ) 2 


k «-L 

i } 

[x u - me) 


s \y ' the average “ of the 


Test statistic: F = s B 2 js w 2 
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Confidence Intervals 


X 



Example: 

• Normal data N(|j, a) 

• x is the estimate of |j 

• based on n samples 

p falls in the interval 

x ± 2 a/Vn 

with approx. 95% probability 
("95% confidence") 


If x is your estimate of some unknown value p, 

the P% confidence interval 
is the interval around x that p will fall in, with 

probability P. 
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Example 


The defect rate of a disk drive manufacturing process is within 0.9% - 
1.7%, with 98% confidence. We inspect a sample of 1000 drives from 
one of our plants. 


• We observe 13 defects in 
our sample. 

• Should we inspect the plant 
for problems? 

• What if we observe 25 
defects in the sample? 
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Check Your Knowledge 


Refer back to the ANOVA example on an earlier slide. What do 
you think? Does the difference between offerl and offer2 make 
a practical difference? Should we go ahead and implement one 
of them? 

If yes, and the costs were US $25 for offerl and US $10 for 
offer2, would you still make the same decision? 

In our manufacturing plant example, assuming you would check 
the plant for problems in the manufacturing process, how might 
you justify this decision financially? 
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Module 3: Review of Basic Data Analytic Methods 
Using R 

part 3: Summary 

During this lesson the following topics were covered: 

• The role of Statistics in the Analytic Lifecycle 

• Developing a model and generating the null and the alternative 
hypothesis 

• Difference between means 

• Difference between significance, power and effect size, and how they 
relate to Type I and Type II errors 

• Applying ANOVA and determining whether the results are significant 

• Defining confidence intervals and applying them 
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Lab Exercise 3: Basic Statistics, Visualization and 
Hypothesis Tests 



This lab is designed to investigate and practice using R to 
perform basic statistics and visualization on data and to 
perform hypothesis testing. 

• After completing the tasks in this lab you should able 
to: 

• Perform basic data analysis 

• Visualize data with R 


• Create and test a hypothesis 
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Lab Exercise 3: Basic Statistics, Visualization and 
Hypothesis Tests- Parti - Workflow 



• Prepare working environment for the Lab and load data files 


• Obtain summary statistics for Household Income and visualize data 


• Obtain summary statistics for number of rooms and visualize data 


• Remove Outliers 


• Stratify Variable - Household Income and plot the results 


• Plot Histogram and Distributions 


• Compute Correlation between income and number of rooms 


• Create a Boxplot - Distribution of income as a factor of number of rooms 


• Exit R 
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Lab Exercise 3: Basic Statistics, Visualization and 
Hypothesis Tests - Part 2 - Workflow 


• Define problem - Analysis of Variance (ANOVA) 

• Generate the Data 

• Examine the Data 

• Plot and determine how purchase size varies within the three groups 

• Use Im() to do the ANOVA 

• Use Tukey’s test to check all the differences of means 

• Use the lattice package for density plot 

• Plot the Logarithms of the Data 

• Use ggplot() package 

• Generate the example data to perform a Hypothesis Test with manual calculations 

• Create a function to calculate the pooled variance, which is used in the Student’s t statistic 

• Examine the Data 

• Calculate the t statistic for Student's t-test 

• Calculate the degrees of freedom 

• Compute the area under the curve 
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